1 00:00:11,077 --> 00:00:14,258 - Okay we have a lot to cover today so let's get started. 2 00:00:14,258 --> 00:00:17,454 Today we'll be talking about Generative Models. 3 00:00:17,454 --> 00:00:20,484 And before we start, a few administrative details. 4 00:00:20,484 --> 00:00:23,522 So midterm grades will be released on Gradescope this week 5 00:00:23,522 --> 00:00:27,730 A reminder that A3 is due next Friday May 26th. 6 00:00:27,730 --> 00:00:32,709 The HyperQuest deadline for extra credit you can do this still until Sunday May 21st. 7 00:00:33,632 --> 00:00:37,799 And our poster session is June 6th from 12 to 3 P.M.. 8 00:00:40,812 --> 00:00:47,759 Okay so an overview of what we're going to talk about today we're going to switch gears a little bit and take a look at unsupervised learning today. 9 00:00:47,759 --> 00:00:54,103 And in particular we're going to talk about generative models which is a type of unsupervised learning. 10 00:00:54,103 --> 00:00:57,112 And we'll look at three types of generative models. 11 00:00:57,112 --> 00:01:01,174 So pixelRNNs and pixelCNNs variational autoencoders 12 00:01:01,174 --> 00:01:04,174 and Generative Adversarial networks. 13 00:01:05,571 --> 00:01:11,168 So so far in this class we've talked a lot about supervised learning and different kinds of supervised learning problems. 14 00:01:11,168 --> 00:01:16,078 So in the supervised learning set up we have our data X and then we have some labels Y. 15 00:01:16,078 --> 00:01:21,417 And our goal is to learn a function that's mapping from our data X to our labels Y. 16 00:01:21,417 --> 00:01:26,237 And these labels can take many different types of forms. 17 00:01:26,237 --> 00:01:34,934 So for example, we've looked at classification where our input is an image and we want to output Y, a class label for the category. 18 00:01:34,934 --> 00:01:44,093 We've talked about object detection where now our input is still an image but here we want to output the bounding boxes of instances of up to multiple dogs or cats. 19 00:01:46,138 --> 00:01:51,986 We've talked about semantic segmentation where here we have a label for every pixel the category that every pixel belongs to. 20 00:01:53,572 --> 00:01:58,961 And we've also talked about image captioning where here our label is now a sentence 21 00:01:58,961 --> 00:02:02,961 and so it's now in the form of natural language. 22 00:02:03,998 --> 00:02:15,661 So unsupervised learning in this set up, it's a type of learning where here we have unlabeled training data and our goal now is to learn some underlying hidden structure of the data. 23 00:02:15,661 --> 00:02:20,370 Right, so an example of this can be something like clustering which you guys might have seen before 24 00:02:20,370 --> 00:02:25,029 where here the goal is to find groups within the data that are similar through some type of metric. 25 00:02:25,029 --> 00:02:27,187 For example, K means clustering. 26 00:02:27,187 --> 00:02:32,871 Another example of an unsupervised learning task is a dimensionality reduction. 27 00:02:33,777 --> 00:02:38,939 So in this problem want to find axes along which our training data has the most variation, 28 00:02:38,939 --> 00:02:43,537 and so these axes are part of the underlying structure of the data. 29 00:02:43,537 --> 00:02:51,095 And then we can use this to reduce of dimensionality of the data such that the data has significant variation among each of the remaining dimensions. 30 00:02:51,095 --> 00:02:57,842 Right, so this example here we start off with data in three dimensions and we're going to find two axes of variation in this case 31 00:02:57,842 --> 00:03:01,259 and reduce our data projected down to 2D. 32 00:03:04,205 --> 00:03:09,964 Another example of unsupervised learning is learning feature representations for data. 33 00:03:11,006 --> 00:03:17,209 We've seen how to do this in supervised ways before where we used the supervised loss, for example classification. 34 00:03:17,209 --> 00:03:21,617 Where we have the classification label. We have something like a Softmax loss 35 00:03:21,617 --> 00:03:29,869 And we can train a neural network where we can interpret activations for example our FC7 layers as some kind of future representation for the data. 36 00:03:29,869 --> 00:03:35,742 And in an unsupervised setting, for example here autoencoders which we'll talk more about later 37 00:03:35,742 --> 00:03:46,872 In this case our loss is now trying to reconstruct the input data to basically, you have a good reconstruction of our input data and use this to learn features. 38 00:03:46,872 --> 00:03:52,245 So we're learning a feature representation without using any additional external labels. 39 00:03:53,471 --> 00:03:59,585 And finally another example of unsupervised learning is density estimation where in this case we want to 40 00:03:59,585 --> 00:04:02,884 estimate the underlying distribution of our data. 41 00:04:02,884 --> 00:04:10,811 So for example in this top case over here, we have points in 1-d and we can try and fit a Gaussian into this density 42 00:04:10,811 --> 00:04:16,605 and in this bottom example over here it's 2D data and here again we're trying to estimate the density and 43 00:04:16,605 --> 00:04:24,239 we can model this density. We want to fit a model such that the density is higher where there's more points concentrated. 44 00:04:26,100 --> 00:04:35,990 And so to summarize the differences in unsupervised learning which we've looked a lot so far, we want to use label data to learn a function mapping from X to Y 45 00:04:35,990 --> 00:04:44,124 and an unsupervised learning we use no labels and instead we try to learn some underlying hidden structure of the data, whether this is grouping, 46 00:04:44,124 --> 00:04:48,291 acts as a variation or underlying density estimation. 47 00:04:49,662 --> 00:04:54,113 And unsupervised learning is a huge and really exciting area of research and 48 00:04:54,113 --> 00:05:04,339 and some of the reasons are that training data is really cheap, it doesn't use labels so we're able to learn from a lot of data at one time and basically utilize a lot 49 00:05:04,339 --> 00:05:09,977 more data than if we required annotating or finding labels for data. 50 00:05:09,977 --> 00:05:17,823 And unsupervised learning is still relatively unsolved research area by comparison. There's a lot of open problems in this, 51 00:05:17,823 --> 00:05:24,669 but it also, it holds the potential of if you're able to successfully learn and represent a lot of the underlying structure 52 00:05:24,669 --> 00:05:32,729 in the data then this also takes you a long way towards the Holy Grail of trying to understand the structure of the visual world. 53 00:05:35,026 --> 00:05:40,432 So that's a little bit of kind of a high-level big picture view of unsupervised learning. 54 00:05:40,432 --> 00:05:44,155 And today will focus more specifically on generative models 55 00:05:44,155 --> 00:05:52,933 which is a class of models for unsupervised learning where given training data our goal is to try and generate new samples from the same distribution. 56 00:05:52,933 --> 00:05:57,686 Right, so we have training data over here generated from some distribution P data 57 00:05:57,686 --> 00:06:04,955 and we want to learn a model, P model to generate samples from the same distribution 58 00:06:04,955 --> 00:06:09,854 and so we want to learn P model to be similar to P data. 59 00:06:09,854 --> 00:06:12,636 And generative models address density estimations. 60 00:06:12,636 --> 00:06:22,180 So this problem that we saw earlier of trying to estimate the underlying distribution of your training data which is a core problem in unsupervised learning. 61 00:06:22,180 --> 00:06:25,190 And we'll see that there's several flavors of this. 62 00:06:25,190 --> 00:06:33,353 We can use generative models to do explicit density estimation where we're going to explicitly define and solve for our P model 63 00:06:35,045 --> 00:06:37,610 or we can also do implicit density estimation 64 00:06:37,610 --> 00:06:45,035 where in this case we'll learn a model that can produce samples from P model without explicitly defining it. 65 00:06:47,700 --> 00:06:54,096 So, why do we care about generative models? Why is this a really interesting core problem in unsupervised learning? 66 00:06:54,096 --> 00:06:57,451 Well there's a lot of things that we can do with generative models. 67 00:06:57,451 --> 00:07:04,659 If we're able to create realistic samples from the data distributions that we want we can do really cool things with this, right? 68 00:07:04,659 --> 00:07:14,568 We can generate just beautiful samples to start with so on the left you can see a completely new samples of just generated by these generative models. 69 00:07:14,568 --> 00:07:21,042 Also in the center here generated samples of images we can also do tasks like super resolution, 70 00:07:21,042 --> 00:07:32,145 colorization so hallucinating or filling in these edges with generated ideas of colors and what the purse should look like. 71 00:07:32,145 --> 00:07:41,619 We can also use generative models of time series data for simulation and planning and so this will be useful in for reinforcement learning applications 72 00:07:41,619 --> 00:07:45,089 which we'll talk a bit more about reinforcement learning in a later lecture. 73 00:07:45,089 --> 00:07:50,261 And training generative models can also enable inference of latent representations. 74 00:07:50,261 --> 00:07:57,435 Learning latent features that can be useful as general features for downstream tasks. 75 00:07:59,059 --> 00:08:05,688 So if we look at types of generative models these can be organized into the taxonomy here 76 00:08:05,688 --> 00:08:13,180 where we have these two major branches that we talked about, explicit density models and implicit density models. 77 00:08:13,180 --> 00:08:19,062 And then we can also get down into many of these other sub categories. 78 00:08:19,062 --> 00:08:27,814 And well we can refer to this figure is adapted from a tutorial on GANs from Ian Goodfellow 79 00:08:27,814 --> 00:08:36,861 and so if you're interested in some of these different taxonomy and categorizations of generative models this is a good resource that you can take a look at. 80 00:08:36,861 --> 00:08:45,645 But today we're going to discuss three of the most popular types of generative models that are in use and in research today. 81 00:08:45,645 --> 00:08:49,475 And so we'll talk first briefly about pixelRNNs and CNNs 82 00:08:49,475 --> 00:08:52,162 And then we'll talk about variational autoencoders. 83 00:08:52,162 --> 00:08:55,661 These are both types of explicit density models. 84 00:08:55,661 --> 00:08:57,494 One that's using a tractable density 85 00:08:57,494 --> 00:09:01,312 and another that's using an approximate density 86 00:09:01,312 --> 00:09:05,614 And then we'll talk about generative adversarial networks, 87 00:09:05,614 --> 00:09:09,781 GANs which are a type of implicit density estimation. 88 00:09:12,152 --> 00:09:16,304 So let's first talk about pixelRNNs and CNNs. 89 00:09:16,304 --> 00:09:20,015 So these are a type of fully visible belief networks 90 00:09:20,015 --> 00:09:22,432 which are modeling a density explicitly 91 00:09:22,432 --> 00:09:34,941 so in this case what they do is we have this image data X that we have and we want to model the probability or likelihood of this image P of X. Right and so in this case, for these kinds of models, 92 00:09:34,941 --> 00:09:40,384 we use the chain rule to decompose this likelihood into a product of one dimensional distribution. 93 00:09:40,384 --> 00:09:43,493 So we have here the probability of each pixel X I 94 00:09:43,493 --> 00:09:47,871 conditioned on all previous pixels X1 through XI - 1. 95 00:09:47,871 --> 00:09:58,073 and your likelihood all right, your joint likelihood of all the pixels in your image is going to be the product of all of these pixels together, all of these likelihoods together. 96 00:09:58,073 --> 00:10:08,938 And then once we define this likelihood, in order to train this model we can just maximize the likelihood of our training data under this defined density. 97 00:10:10,980 --> 00:10:20,833 So if we look at this this distribution over pixel values right, we have this P of XI given all the previous pixel values, well this is a really complex distribution. 98 00:10:20,833 --> 00:10:22,700 So how can we model this? 99 00:10:22,700 --> 00:10:29,042 Well we've seen before that if we want to have complex transformations we can do these using neural networks. 100 00:10:29,042 --> 00:10:32,828 Neural networks are a good way to express complex transformations. 101 00:10:32,828 --> 00:10:42,300 And so what we'll do is we'll use a neural network to express this complex function that we have of the distribution. 102 00:10:43,235 --> 00:10:44,796 And one thing you'll see here is that, 103 00:10:44,796 --> 00:10:51,212 okay even if we're going to use a neural network for this another thing we have to take care of is how do we order the pixels. 104 00:10:51,212 --> 00:10:58,886 Right, I said here that we have a distribution for P of XI given all previous pixels but what does all previous the pixels mean? 105 00:10:58,886 --> 00:11:01,303 So we'll take a look at that. 106 00:11:03,336 --> 00:11:06,669 So PixelRNN was a model proposed in 2016 107 00:11:07,595 --> 00:11:17,657 that basically defines a way for setting up and optimizing this problem and so how this model works is 108 00:11:17,657 --> 00:11:21,187 that we're going to generate pixels starting in a corner of the image. 109 00:11:21,187 --> 00:11:31,050 So we can look at this grid as basically the pixels of your image and so what we're going to do is start from the pixel in the upper left-hand corner 110 00:11:31,050 --> 00:11:37,195 and then we're going to sequentially generate pixels based on these connections from the arrows that you can see here. 111 00:11:37,195 --> 00:11:44,332 And each of the dependencies on the previous pixels in this ordering is going to be modeled using an RNN 112 00:11:44,332 --> 00:11:48,092 or more specifically an LSTM which we've seen before in lecture. 113 00:11:48,092 --> 00:11:55,242 Right so using this we can basically continue to move forward just moving down a long is diagonal 114 00:11:55,242 --> 00:12:01,244 and generating all of these pixel values dependent on the pixels that they're connected to. 115 00:12:01,244 --> 00:12:08,736 And so this works really well but the drawback here is that this sequential generation, right, so it's actually quite slow to do this. 116 00:12:08,736 --> 00:12:15,061 You can imagine you know if you're going to generate a new image instead of all of these feed forward networks that we see, we've seen with CNNs. 117 00:12:15,061 --> 00:12:20,952 Here we're going to have to iteratively go through and generate all these images, all these pixels. 118 00:12:24,044 --> 00:12:30,575 So a little bit later, after a pixelRNN, another model called pixelCNN was introduced. 119 00:12:30,575 --> 00:12:34,570 And this has very similar setup as pixelCNN 120 00:12:34,570 --> 00:12:43,074 and we're still going to do this image generation starting from the corner of the of the image and expanding outwards but the difference now is that now instead of using 121 00:12:43,074 --> 00:12:47,752 an RNN to model all these dependencies we're going to use the CNN instead. 122 00:12:47,752 --> 00:12:52,179 And we're now going to use a CNN over a a context region 123 00:12:52,179 --> 00:12:56,384 that you can see here around in the particular pixel that we're going to generate now. 124 00:12:56,384 --> 00:13:09,313 Right so we take the pixels around it, this gray area within the region that's already been generated and then we can pass this through a CNN and use that to generate our next pixel value. 125 00:13:11,041 --> 00:13:18,055 And so what this is going to give is this is going to give This is a CNN, a neural network at each pixel location 126 00:13:18,055 --> 00:13:22,967 right and so the output of this is going to be a soft max loss over the pixel values here. 127 00:13:22,967 --> 00:13:31,193 In this case we have a 0 to 255 and then we can train this by maximizing the likelihood of the training images. 128 00:13:31,193 --> 00:13:43,482 Right so we say that basically we want to take a training image we're going to do this generation process and at each pixel location we have the ground truth 129 00:13:43,482 --> 00:13:53,976 training data image value that we have here and this is a quick basically the label or the the the classification label that we want our pixel to be which of these 255 values 130 00:13:53,976 --> 00:13:56,723 and we can train this using a Softmax loss. 131 00:13:56,723 --> 00:14:05,597 Right and so basically the effect of doing this is that we're going to maximize the likelihood of our training data pixels being generated. 132 00:14:05,597 --> 00:14:08,413 Okay any questions about this? Yes. 133 00:14:08,413 --> 00:14:12,159 [student's words obscured due to lack of microphone] 134 00:14:12,159 --> 00:14:18,675 Yeah, so the question is, I thought we were talking about unsupervised learning, why do we have basically a classification label here? 135 00:14:18,675 --> 00:14:24,970 The reason is that this loss, this output that we have is the value of the input training data. 136 00:14:24,970 --> 00:14:26,983 So we have no external labels, right? 137 00:14:26,983 --> 00:14:38,533 We didn't go and have to manually collect any labels for this, we're just taking our input data and saying that this is what we used for the last function. 138 00:14:41,199 --> 00:14:45,366 [student's words obscured due to lack of microphone] 139 00:14:47,998 --> 00:14:50,746 The question is, is this like bag of words? 140 00:14:50,746 --> 00:14:53,109 I would say it's not really bag of words, 141 00:14:53,109 --> 00:15:01,466 it's more saying that we want where we're outputting a distribution over pixel values at each location of our image right, and what we want to do 142 00:15:01,466 --> 00:15:10,442 is we want to maximize the likelihood of our input, our training data being produced, being generated. 143 00:15:10,442 --> 00:15:15,761 Right so, in that sense, this is why it's using our input data to create our loss. 144 00:15:21,006 --> 00:15:24,904 So using pixelCNN training is faster than pixelRNN 145 00:15:24,904 --> 00:15:34,301 because here now right at every pixel location we want to maximize the value of our, we want to maximize the likelihood of our training data 146 00:15:34,301 --> 00:15:40,739 showing up and so we have all of these values already right, just from our training data and so we can do this much 147 00:15:40,739 --> 00:15:47,296 faster but a generation time for a test time we want to generate a completely new image right, just starting from 148 00:15:47,296 --> 00:15:59,197 the corner and we're not, we're not trying to do any type of learning so in that generation time we still have to generate each of these pixel locations before we can generate the next location. 149 00:15:59,197 --> 00:16:03,025 And so generation time here it still slow even though training time is faster. 150 00:16:03,025 --> 00:16:04,204 Question. 151 00:16:04,204 --> 00:16:08,365 [student's words obscured due to lack of microphone] 152 00:16:08,365 --> 00:16:14,077 So the question is, is this training a sensitive distribution to what you pick for the first pixel? 153 00:16:14,077 --> 00:16:21,208 Yeah, so it is dependent on what you have as the initial pixel distribution and then everything is conditioned based on that. 154 00:16:23,203 --> 00:16:32,171 So again, how do you pick this distribution? So at training time you have these distributions from your training data and then at generation time 155 00:16:32,171 --> 00:16:38,368 you can just initialize this with either uniform or from your training data, however you want. 156 00:16:38,368 --> 00:16:42,553 And then once you have that everything else is conditioned based on that. 157 00:16:42,553 --> 00:16:43,912 Question. 158 00:16:43,912 --> 00:16:48,079 [student's words obscured due to lack of microphone] 159 00:17:07,415 --> 00:17:14,146 Yeah so the question is is there a way that we define this in this chain rule fashion instead of predicting all the pixels at one time? 160 00:17:14,146 --> 00:17:17,884 And so we'll see, we'll see models later that do do this, 161 00:17:17,884 --> 00:17:27,868 but what the chain rule allows us to do is it allows us to find this very tractable density that we can then basically optimize and do, directly optimizes likelihood 162 00:17:31,864 --> 00:17:39,606 Okay so these are some examples of generations from this model and so here on the left you can see 163 00:17:39,606 --> 00:17:48,846 generations where the training data is CIFAR-10, CIFAR-10 dataset. And so you can see that in general they are starting to capture statistics of natural images. 164 00:17:48,846 --> 00:17:56,848 You can see general types of blobs and kind of things that look like parts of natural images coming out. 165 00:17:56,848 --> 00:18:02,768 On the right here it's ImageNet, we can again see samples from here and these are starting to look like natural images 166 00:18:05,060 --> 00:18:09,966 but they're still not, there's still room for improvement. 167 00:18:09,966 --> 00:18:17,059 You can still see that there are differences obviously with regional training images and some of the semantics are not clear in here. 168 00:18:19,371 --> 00:18:27,020 So, to summarize this, pixelRNNs and CNNs allow you to explicitly compute likelihood P of X. 169 00:18:27,020 --> 00:18:29,297 It's an explicit density that we can optimize. 170 00:18:29,297 --> 00:18:34,043 And being able to do this also has another benefit of giving a good evaluation metric. 171 00:18:34,043 --> 00:18:40,958 You know you can kind of measure how good your samples are by this likelihood of the data that you can compute. 172 00:18:40,958 --> 00:18:47,043 And it's able to produce pretty good samples but it's still an active area of research 173 00:18:47,043 --> 00:18:53,760 and the main disadvantage of these methods is that the generation is sequential and so it can be pretty slow. 174 00:18:53,760 --> 00:18:59,324 And these kinds of methods have also been used for generating audio for example. 175 00:18:59,324 --> 00:19:08,170 And you can look online for some pretty interesting examples of this, but again the drawback is that it takes a long time to generate these samples. 176 00:19:08,170 --> 00:19:14,565 And so there's a lot of work, has been work since then on still on improving pixelCNN performance 177 00:19:14,565 --> 00:19:22,346 And so all kinds of different you know architecture changes add the loss function formulating this differently on different types of training tricks 178 00:19:22,346 --> 00:19:29,495 And so if you're interested in learning more about this you can look at some of these papers on PixelCNN 179 00:19:29,495 --> 00:19:35,115 and then other pixelCNN plus plus better improved version that came out this year. 180 00:19:37,455 --> 00:19:44,321 Okay so now we're going to talk about another type of generative models call variational autoencoders. 181 00:19:44,321 --> 00:19:52,204 And so far we saw that pixelCNNs defined a tractable density function, right, using this this definition 182 00:19:52,204 --> 00:19:58,365 and based on that we can optimize directly optimize the likelihood of the training data. 183 00:19:59,419 --> 00:20:04,195 So with variational autoencoders now we're going to define an intractable density function. 184 00:20:04,195 --> 00:20:10,769 We're now going to model this with an additional latent variable Z and we'll talk in more detail about how this looks. 185 00:20:10,769 --> 00:20:17,886 And so our data likelihood P of X is now basically has to be this integral right, 186 00:20:17,886 --> 00:20:21,422 taking the expectation over all possible values of Z. 187 00:20:21,422 --> 00:20:26,909 And so this now is going to be a problem. We'll see that we cannot optimize this directly. 188 00:20:26,909 --> 00:20:33,706 And so instead what we have to do is we have to derive and optimize a lower bound on the likelihood instead. 189 00:20:33,706 --> 00:20:34,956 Yeah, question. 190 00:20:35,864 --> 00:20:37,592 So the question is is what is Z? 191 00:20:37,592 --> 00:20:42,862 Z is a latent variable and I'll go through this in much more detail. 192 00:20:44,479 --> 00:20:48,538 So let's talk about some background first. 193 00:20:48,538 --> 00:20:54,733 Variational autoencoders are related to a type of unsupervised learning model called autoencoders. 194 00:20:54,733 --> 00:21:00,965 And so we'll talk little bit more first about autoencoders and what they are and then I'll explain how variational 195 00:21:00,965 --> 00:21:05,851 autoencoders are related and build off of this and allow you to generate data. 196 00:21:05,851 --> 00:21:09,168 So with autoencoders we don't use this to generate data, 197 00:21:09,168 --> 00:21:15,719 but it's an unsupervised approach for learning a lower dimensional feature representation from unlabeled training data. 198 00:21:15,719 --> 00:21:21,550 All right so in this case we have our input data X and then we're going to want to learn some features that we call Z. 199 00:21:22,541 --> 00:21:29,605 And then we'll have an encoder that's going to be a mapping, a function mapping from this input data to our feature Z. 200 00:21:30,911 --> 00:21:33,905 And this encoder can take many different forms right, 201 00:21:33,905 --> 00:21:41,239 they would generally use neural networks so originally these models have been around, autoencoders have been around for a long time. 202 00:21:41,239 --> 00:21:45,803 So in the 2000s we used linear layers of non-linearities, 203 00:21:45,803 --> 00:21:54,389 then later on we had fully connected deeper networks and then after that we moved on to using CNNs for these encoders. 204 00:21:55,385 --> 00:22:01,351 So we take our input data X and then we map this to some feature Z. 205 00:22:01,351 --> 00:22:11,817 And Z we usually have as, we usually specify this to be smaller than X and we perform basically dimensionality reduction because of that. 206 00:22:11,817 --> 00:22:17,729 So the question who has an idea of why do we want to do dimensionality reduction here? 207 00:22:17,729 --> 00:22:20,896 Why do we want Z to be smaller than X? 208 00:22:22,114 --> 00:22:25,497 Yeah. [student's words obscured due to lack of microphone] 209 00:22:25,497 --> 00:22:31,657 So the answer I heard is Z should represent the most important features in X and that's correct. 210 00:22:32,634 --> 00:22:41,758 So we want Z to be able to learn features that can capture meaningful factors of variation in the data. Right this makes them good features. 211 00:22:42,833 --> 00:22:46,717 So how can we learn this feature representation? 212 00:22:46,717 --> 00:22:55,944 Well the way autoencoders do this is that we train the model such that the features can be used to reconstruct our original data. 213 00:22:55,944 --> 00:23:03,730 So what we want is we want to have input data that we use an encoder to map it to some lower dimensional features Z. 214 00:23:05,320 --> 00:23:06,926 This is the output of the encoder network, 215 00:23:06,926 --> 00:23:16,554 and we want to be able to take these features that were produced based on this input data and then use a decoder a second network and be able to output now something 216 00:23:16,554 --> 00:23:24,865 of the same size dimensionality as X and have it be similar to X right so we want to be able to reconstruct the original data. 217 00:23:26,387 --> 00:23:38,583 And again for the decoder we are basically using same types of networks as encoders so it's usually a little bit symmetric and now we can use CNN networks for most of these. 218 00:23:41,675 --> 00:23:48,720 Okay so the process is going to be we're going to take our input data right we pass it through our encoder first 219 00:23:48,720 --> 00:23:53,996 which is going to be something for example like a four layer convolutional network and then we're going to pass it, 220 00:23:53,996 --> 00:24:04,196 get these features and then we're going to pass it through a decoder which is a four layer for example upconvolutional network and then get a reconstructed data out at the end of this. 221 00:24:04,196 --> 00:24:14,409 Right in the reason why we have a convolutional network for the encoder and an upconvolutional network for the decoder is because at the encoder we're basically 222 00:24:14,409 --> 00:24:25,893 taking it from this high dimensional input to these lower dimensional features and now we want to go the other way go from our low dimensional features back out to our high dimensional reconstructed input. 223 00:24:28,906 --> 00:24:39,071 And so in order to get this effect that we said we wanted before of being able to reconstruct our input data we'll use something like an L2 loss function. 224 00:24:39,071 --> 00:24:49,306 Right that basically just says let me make my pixels of my input data to be the same as my, my pixels in my reconstructed data to be the same as the pixels of my input data. 225 00:24:51,078 --> 00:24:58,599 An important thing to notice here, this relates back to a question that we had earlier, is that even though we have this loss function here, 226 00:24:58,599 --> 00:25:02,515 there's no, there's no external labels that are being used in training this. 227 00:25:02,515 --> 00:25:10,861 All we have is our training data that we're going to use both to pass through the network as well as to compute our loss function. 228 00:25:13,346 --> 00:25:19,021 So once we have this after training this model what we can do is we can throw away this decoder. 229 00:25:19,021 --> 00:25:26,108 All this was used was too to be able to produce our reconstruction input and be able to compute our loss function. 230 00:25:26,108 --> 00:25:34,819 And we can use the encoder that we have which produces our feature mapping and we can use this to initialize a supervised model. 231 00:25:34,819 --> 00:25:45,773 Right and so for example we can now go from this input to our features and then have an additional classifier network on top of this that now we can use to output 232 00:25:45,773 --> 00:25:55,601 a class label for example for classification problem we can have external labels from here and use our standard loss functions like Softmax. 233 00:25:55,601 --> 00:26:04,449 And so the value of this is that we basically were able to use a lot of unlabeled training data to try and learn good general feature representations. 234 00:26:04,449 --> 00:26:12,363 Right, and now we can use this to initialize a supervised learning problem where sometimes we don't have so much data we only have small data. 235 00:26:12,363 --> 00:26:19,697 And we've seen in previous homeworks and classes that with small data it's hard to learn a model, right? 236 00:26:19,697 --> 00:26:22,563 You can have over fitting and all kinds of problems 237 00:26:22,563 --> 00:26:27,540 and so this allows you to initialize your model first with better features. 238 00:26:31,371 --> 00:26:42,329 Okay so we saw that autoencoders are able to reconstruct data and are able to, as a result, learn features to initialize, that we can use to initialize a supervised model. 239 00:26:42,329 --> 00:26:50,133 And we saw that these features that we learned have this intuition of being able to capture factors of variation in the training data. 240 00:26:50,133 --> 00:26:58,941 All right so based on this intuition of okay these, we can have this latent this vector Z which has factors of variation in our training data. 241 00:26:58,941 --> 00:27:04,957 Now a natural question is well can we use a similar type of setup to generate new images? 242 00:27:06,922 --> 00:27:09,502 And so now we will talk about variational autoencoders 243 00:27:09,502 --> 00:27:15,987 which is a probabillstic spin on autoencoders that will let us sample from the model in order to generate new data. 244 00:27:15,987 --> 00:27:19,404 Okay any questions on autoencoders first? 245 00:27:20,796 --> 00:27:22,828 Okay, so variational autoencoders. 246 00:27:22,828 --> 00:27:28,914 All right so here we assume that our training data that we have X I from one to N 247 00:27:30,255 --> 00:27:34,812 is generated from some underlying, unobserved latent representation Z. 248 00:27:34,812 --> 00:27:38,357 Right, so it's this intuition that Z is some vector 249 00:27:38,357 --> 00:27:47,069 right which element of Z is capturing how little or how much of some factor of variation that we have in our training data. 250 00:27:48,491 --> 00:27:54,811 Right so the intuition is, you know, maybe these could be something like different kinds of attributes. Let's say we're trying to generate faces, 251 00:27:54,811 --> 00:28:02,608 it could be how much of a smile is on the face, it could be position of the eyebrows hair orientation of the head. 252 00:28:02,608 --> 00:28:08,772 These are all possible types of latent factors that could be learned. 253 00:28:08,772 --> 00:28:13,901 Right, and so our generation process is that we're going to sample from a prior over Z. 254 00:28:13,901 --> 00:28:25,014 Right so for each of these attributes for example, you know, how much smile that there is, we can have a prior over what sort of distribution we think that there should be for this so, 255 00:28:25,014 --> 00:28:31,571 a gaussian is something that's a natural prior that we can use for each of these factors of Z 256 00:28:31,571 --> 00:28:40,140 and then we're going to generate our data X by sampling from a conditional, conditional distribution P of X given Z. 257 00:28:40,140 --> 00:28:48,862 So we sample Z first, we sample a value for each of these latent factors and then we'll use that and sample our image X from here. 258 00:28:51,409 --> 00:28:57,667 And so the true parameters of this generation process are theta, theta star right? 259 00:28:57,667 --> 00:29:03,158 So we have the parameters of our prior and our conditional distributions 260 00:29:03,158 --> 00:29:11,727 and what we want to do is in order to have a generative model be able to generate new data we want to estimate these parameters of our true parameters 261 00:29:14,790 --> 00:29:17,611 Okay so let's first talk about how should we represent this model. 262 00:29:20,282 --> 00:29:27,317 All right, so if we're going to have a model for this generator process, well we've already said before that we can choose our prior P of Z to be something simple. 263 00:29:27,317 --> 00:29:32,713 Something like a Gaussian, right? And this is the reasonable thing to choose for for latent attributes. 264 00:29:35,696 --> 00:29:40,840 Now for our conditional distribution P of X given Z this is much more complex right, 265 00:29:40,840 --> 00:29:43,410 because we need to use this to generate an image 266 00:29:43,410 --> 00:29:53,062 and so for P of X given Z, well as we saw before, when we have some type of complex function that we want to represent we can represent this with a neural network. 267 00:29:53,062 --> 00:29:58,259 And so that's a natural choice for let's try and model P of X given Z with a neural network. 268 00:30:00,308 --> 00:30:02,345 And we're going to call this the decoder network. 269 00:30:02,345 --> 00:30:10,167 Right, so we're going to think about taking some latent representation and trying to decode this into the image that it's specifying. 270 00:30:10,167 --> 00:30:13,765 So now how can we train this model? 271 00:30:13,765 --> 00:30:19,419 Right, we want to be able to train this model so that we can learn an estimate of these parameters. 272 00:30:19,419 --> 00:30:26,668 So if we remember our strategy from training generative models, back from are fully visible belief networks, our pixelRNNs and CNNs, 273 00:30:28,577 --> 00:30:35,498 a straightforward natural strategy is to try and learn these model parameters in order to maximize the likelihood of the training data. 274 00:30:35,498 --> 00:30:39,346 Right, so we saw earlier that in this case, with our latent variable Z, we're going to have 275 00:30:39,346 --> 00:30:49,884 to write out P of X taking expectation over all possible values of Z which is continuous and so we get this expression here. Right so now we have it with this latent Z 276 00:30:49,884 --> 00:30:55,759 and now if we're going to, if you want to try and maximize its likelihood, well what's the problem? 277 00:30:55,759 --> 00:31:01,372 Can we just take this take gradients and maximize this likelihood? 278 00:31:01,372 --> 00:31:04,358 [student's words obscured due to lack of microphone] 279 00:31:04,358 --> 00:31:08,524 Right, so this integral is not going to be tractable, that's correct. 280 00:31:10,199 --> 00:31:12,547 So let's take a look at this in a little bit more detail. 281 00:31:12,547 --> 00:31:18,772 Right, so we have our data likelihood term here. And the first time is P of Z. 282 00:31:18,772 --> 00:31:24,847 And here we already said earlier, we can just choose this to be a simple Gaussian prior, so this is fine. 283 00:31:24,847 --> 00:31:29,031 P of X given Z, well we said we were going to specify a decoder neural network. 284 00:31:29,031 --> 00:31:32,774 So given any Z, we can get P of X given Z from here. 285 00:31:32,774 --> 00:31:35,721 It's the output of our neural network. 286 00:31:35,721 --> 00:31:38,147 But then what's the problem here? 287 00:31:38,147 --> 00:31:48,435 Okay this was supposed to be a different unhappy face but somehow I don't know what happened, in the process of translation, it turned into a crying black ghost 288 00:31:49,298 --> 00:31:58,591 but what this is symbolizing is that basically if we want to compute P of X given Z for every Z this is now intractable right, 289 00:31:59,519 --> 00:32:02,186 we cannot compute this integral. 290 00:32:04,794 --> 00:32:06,591 So data likelihood is intractable 291 00:32:06,591 --> 00:32:19,639 and it turns out that if we look at other terms in this model if we look at our posterior density, So P of our posterior of Z given X, then this is going to be P of X given Z 292 00:32:19,639 --> 00:32:23,712 times P of Z over P of X by Bayes' rule 293 00:32:23,712 --> 00:32:25,740 and this is also going to be intractable, right. 294 00:32:25,740 --> 00:32:35,143 We have P of X given Z is okay, P of Z is okay, but we have this P of X our likelihood which has the integral and it's intractable. 295 00:32:36,027 --> 00:32:37,993 So we can't directly optimizes this. 296 00:32:37,993 --> 00:32:45,230 but we'll see that a solution, a solution that will enable us to learn this model 297 00:32:45,230 --> 00:32:54,824 is if in addition to using a decoder network defining this neural network to model P of X given Z. If we now define an additional encoder network 298 00:32:54,824 --> 00:33:06,652 Q of Z given X we're going to call this an encoder because we want to turn our input X into, get the likelihood of Z given X, we're going to encode this into Z. 299 00:33:06,652 --> 00:33:10,329 And defined this network to approximate the P of Z given X. 300 00:33:12,388 --> 00:33:15,688 Right this was posterior density term now is also intractable. 301 00:33:15,688 --> 00:33:22,866 If we use this additional network to approximate this then we'll see that this will actually allow us to derive 302 00:33:22,866 --> 00:33:27,486 a lower bound on the data likelihood that is tractable and which we can optimize. 303 00:33:29,308 --> 00:33:35,396 Okay so first just to be a little bit more concrete about these encoder and decoder networks that I specified, 304 00:33:36,579 --> 00:33:40,695 in variational autoencoders we want the model probabilistic generation of data. 305 00:33:40,695 --> 00:33:51,530 So in autoencoders we already talked about this concept of having an encoder going from input X to some feature Z and a decoder network going from Z back out to some image X. 306 00:33:53,294 --> 00:33:58,907 And so here we go to again have an encoder network and a decoder network but we're going to make these probabilistic. 307 00:33:58,907 --> 00:34:06,134 So now our encoder network Q of Z given X with parameters phi are going to output a mean 308 00:34:06,134 --> 00:34:09,467 and a diagonal covariance and from here, 309 00:34:11,411 --> 00:34:14,795 this will be the direct outputs of our encoder network and the same thing for our 310 00:34:14,795 --> 00:34:23,109 decoder network which is going to start from Z and now it's going to output the mean and the diagonal covariance of some X, 311 00:34:23,109 --> 00:34:26,725 same dimension as the input given Z 312 00:34:26,725 --> 00:34:29,478 And then this decoder network has different parameters theta. 313 00:34:31,136 --> 00:34:42,058 And now in order to actually get our Z and our, This should be Z given X and X given Z. We'll sample from these distributions. 314 00:34:42,058 --> 00:34:49,072 So now our encoder and our decoder network are producing distributions over Z and X respectively 315 00:34:49,072 --> 00:34:52,409 and will sample from this distribution in order to get a value from here. 316 00:34:52,409 --> 00:34:59,630 So you can see how this is taking us on the direction towards being able to sample and generate new data. 317 00:34:59,630 --> 00:35:05,041 And just one thing to note is that these encoder and decoder networks, you'll also hear different terms for them. 318 00:35:05,041 --> 00:35:09,138 The encoder network can also be kind of recognition or inference network because 319 00:35:09,138 --> 00:35:15,913 we're trying to form inference of this latent representation of Z given X and then for the decoder 320 00:35:15,913 --> 00:35:18,826 network, this is what we'll use to perform generation. 321 00:35:18,826 --> 00:35:22,993 Right so you also hear generation network being used. 322 00:35:24,410 --> 00:35:31,899 Okay so now equipped with our encoder and decoder networks, let's try and work out the data likelihood again. 323 00:35:31,899 --> 00:35:35,117 and we'll use the log of the data likelihood here. 324 00:35:35,117 --> 00:35:38,833 So we'll see that if we want the log of P of X right 325 00:35:38,833 --> 00:35:44,988 we can write this out as like a P of X but take the expectation with respect to Z. 326 00:35:44,988 --> 00:35:51,053 So Z samples from our distribution of Q of Z given X that we've now defined using the encoder network. 327 00:35:52,606 --> 00:35:58,254 And we can do this because P of X doesn't depend on Z. Right 'cause Z is not part of that. 328 00:35:58,254 --> 00:36:04,794 And so we'll see that taking the expectation with respect to Z is going to come in handy later on. 329 00:36:06,255 --> 00:36:20,564 Okay so now from this original expression we can now expand it out to be log of P of X given Z, P of Z over P of Z given X using Bayes' rule. And so this is just directly writing this out. 330 00:36:20,564 --> 00:36:24,996 And then taking this we can also now multiply it by a constant. 331 00:36:24,996 --> 00:36:30,874 Right, so Q of Z given X over Q of Z given X. This is one we can do this. 332 00:36:30,874 --> 00:36:33,847 It doesn't change it but it's going to be helpful later on. 333 00:36:33,847 --> 00:36:39,444 So given that what we'll do is we'll write it out into these three separate terms. 334 00:36:39,444 --> 00:36:44,703 And you can work out this math later on by yourself but it's essentially just using logarithm rules 335 00:36:44,703 --> 00:36:54,728 taking all of these terms that we had in the line above and just separating it out into these three different terms that will have nice meanings. 336 00:36:56,431 --> 00:37:02,754 Right so if we look at this, the first term that we get separated out is log of P given X and then expectation 337 00:37:02,754 --> 00:37:07,210 of log of P given X and then we're going to have two KL terms, right. 338 00:37:07,210 --> 00:37:14,400 This is basically KL divergence term to say how close these two distributions are. 339 00:37:14,400 --> 00:37:18,567 So how close is a distribution Q of Z given X to P of Z. 340 00:37:19,489 --> 00:37:24,287 So it's just the, it's exactly this expectation term above. 341 00:37:24,287 --> 00:37:28,454 And it's just a distance metric for distributions. 342 00:37:30,908 --> 00:37:36,183 And so we'll see that, right, we saw that these are nice KL terms that we can write out. 343 00:37:36,183 --> 00:37:39,290 And now if we look at these three terms that we have here, 344 00:37:39,290 --> 00:37:45,819 the first term is P of X given Z, which is provided by our decoder network. 345 00:37:45,819 --> 00:37:52,042 And we're able to compute an estimate of these term through sampling and we'll see that we can 346 00:37:52,042 --> 00:37:56,099 do a sampling that's differentiable through something called the re-parametrization trick which is a 347 00:37:56,099 --> 00:37:59,920 detail that you can look at this paper if you're interested. 348 00:37:59,920 --> 00:38:02,479 But basically we can now compute this term. 349 00:38:02,479 --> 00:38:08,600 And then these KL terms, the second KL term is a KL between two Gaussians, 350 00:38:08,600 --> 00:38:16,079 so our Q of Z given X, remember our encoder produced this distribution which had a mean and a covariance, it was a nice Gaussian. 351 00:38:16,079 --> 00:38:19,892 And then also our prior P of Z which is also a Gaussian. 352 00:38:19,892 --> 00:38:25,628 And so this has a nice, when you have a KL of two Gaussians you have a nice closed form solution that you can have. 353 00:38:25,628 --> 00:38:31,324 And then this third KL term now, this is a KL of Q given X with a P of Z given X. 354 00:38:32,303 --> 00:38:36,766 But we know that P of Z given X was this intractable posterior that we saw earlier, right? 355 00:38:36,766 --> 00:38:41,794 That we didn't want to compute that's why we had this approximation using Q. 356 00:38:41,794 --> 00:38:44,625 And so this term is still is a problem. 357 00:38:44,625 --> 00:38:54,776 But one thing we do know about this term is that KL divergence, it's a distance between two distributions is always greater than or equal to zero by definition. 358 00:38:57,060 --> 00:39:03,396 And so what we can do with this is that, well what we have here, the two terms that we can work nicely with, this is a, 359 00:39:03,396 --> 00:39:10,023 this is a tractable lower bound which we can actually take gradient of and optimize. 360 00:39:10,023 --> 00:39:16,652 P of X given Z is differentiable and the KL terms are also, the close form solution is also differentiable. 361 00:39:16,652 --> 00:39:24,168 And this is a lower bound because we know that the KL term on the right, the ugly one is greater than or equal it zero. 362 00:39:24,168 --> 00:39:26,251 So we have a lower bound. 363 00:39:27,273 --> 00:39:37,699 And so what we'll do to train a variational autoencoder is that we take this lower bound and we instead optimize and maximize this lower bound instead. 364 00:39:37,699 --> 00:39:42,251 So we're optimizing a lower bound on the likelihood of our data. 365 00:39:42,251 --> 00:39:49,940 So that means that our data is always going to have a likelihood that's at least as high as this lower bound that we're maximizing. 366 00:39:49,940 --> 00:39:58,941 And so we want to find the parameters theta, estimate parameters theta and phi that allows us to maximize this. 367 00:40:03,169 --> 00:40:06,412 And then one last sort of intuition about this lower bound 368 00:40:06,412 --> 00:40:12,796 that we have is that this first term is expectation over all samples of Z 369 00:40:12,796 --> 00:40:22,699 sampled from passing our X through the encoder network sampling Z taking expectation over all of these samples of likelihood of X given Z 370 00:40:24,963 --> 00:40:26,854 and so this is a reconstruction, right? 371 00:40:26,854 --> 00:40:33,300 This is basically saying, if I want this to be big I want this likelihood P of X given Z to be high, 372 00:40:33,300 --> 00:40:37,756 so it's kind of like trying to do a good job reconstructing the data. 373 00:40:37,756 --> 00:40:40,528 So similar to what we had from our autoencoder before. 374 00:40:40,528 --> 00:40:44,695 But the second term here is saying make this KL small. 375 00:40:46,161 --> 00:40:51,283 Make our approximate posterior distribution close to our prior distribution. 376 00:40:51,283 --> 00:41:04,558 And this basically is saying that well we want our latent variable Z to be following this, have this distribution type, distribution shape that we would like it to have. 377 00:41:08,974 --> 00:41:12,058 Okay so any questions about this? 378 00:41:12,058 --> 00:41:19,128 I think this is a lot of math that if you guys are interested you should go back and kind of work through all of the derivations yourself. 379 00:41:19,128 --> 00:41:19,961 Yeah. 380 00:41:20,883 --> 00:41:23,669 [student's words obscured due to lack of microphone] 381 00:41:23,669 --> 00:41:29,373 So the question is why do we specify the prior and the latent variables as Gaussian? 382 00:41:29,373 --> 00:41:33,512 And the reason is that well we're defining some sort of generative process right, 383 00:41:33,512 --> 00:41:35,930 of sampling Z first and then sampling X first. 384 00:41:35,930 --> 00:41:53,307 And defining it as a Gaussian is a reasonable type of prior that we can say makes sense for these types of latent attributes to be distributed according to some sort of Gaussian, and then this lets us now then optimize our model. 385 00:41:55,988 --> 00:42:06,053 Okay, so we talked about how we can deride this lower bound and now let's put this all together and walk through the process of the training of the AE. 386 00:42:06,053 --> 00:42:10,008 Right so here's the bound that we want to optimize, to maximize. 387 00:42:10,008 --> 00:42:19,301 And now for a forward pass. We're going to proceed in the following manner. We have our input data X, so we'll a mini batch of input data. 388 00:42:20,845 --> 00:42:26,544 And then we'll pass it through our encoder network so we'll get Q of Z given X. 389 00:42:28,439 --> 00:42:35,805 And from this Q of Z given X, this'll be the terms that we use to compute the KL term. 390 00:42:35,805 --> 00:42:46,856 And then from here we'll sample Z from this distribution of Z given X so we have a sample of the latent factors that we can infer from X. 391 00:42:50,721 --> 00:42:54,889 And then from here we're going to pass a Z through another, our second decoder network. 392 00:42:54,889 --> 00:43:07,686 And from the decoder network we'll get this output for the mean and variance on our distribution for X given Z and then finally we can sample now our X given Z from this distribution 393 00:43:07,686 --> 00:43:12,155 and here this will produce some sample output. 394 00:43:12,155 --> 00:43:23,517 And when we're training we're going to take this distribution and say well our loss term is going to be log of our training image pixel values given Z. 395 00:43:23,612 --> 00:43:30,684 So our loss functions going to say let's maximize the likelihood of this original input being reconstructed. 396 00:43:32,020 --> 00:43:35,919 And so now for every mini batch of input we're going to compute this forward pass. 397 00:43:35,919 --> 00:43:43,837 Get all these terms that we need and then this is all differentiable so then we just backprop though all of this and then get our gradient, 398 00:43:43,837 --> 00:43:57,040 we update our model and we use this to continuously update our parameters, our generator and decoder network parameters theta and phi in order to maximize the likelihood of the trained data. 399 00:43:58,408 --> 00:44:05,547 Okay so once we've trained our VAE, so now to generate data, what we can do is we can use just the decoder network. 400 00:44:05,547 --> 00:44:15,504 All right, so from here we can sample Z now, instead of sampling Z from this posterior that we had during training, while during generation we sample from our true generative process. 401 00:44:15,504 --> 00:44:18,673 So we sample from our prior that we specify. 402 00:44:18,673 --> 00:44:22,840 And then we're going to then sample our data X from here. 403 00:44:25,281 --> 00:44:34,798 And we'll see that this can produce, in this case, train on MNIST, these are samples of digits generated from a VAE trained on MNIST. 404 00:44:36,058 --> 00:44:43,796 And you can see that, you know, we talked about this idea of Z representing these latent factors where we can 405 00:44:43,796 --> 00:44:52,625 bury Z right according to our sample from different parts of our prior and then get different kind of interpretable meanings from here. 406 00:44:52,625 --> 00:44:57,142 So here we can see that this is the data manifold for two dimensional Z. 407 00:44:57,142 --> 00:45:08,568 So if we have a two dimensional Z and we take Z and let's say some range from you know, from different percentiles of the distribution, and we vary Z1 and we vary Z2, 408 00:45:08,568 --> 00:45:16,300 then you can see how the image generated from every combination of Z1 and Z2 that we have here, 409 00:45:16,300 --> 00:45:22,087 you can see it's transitioning smoothly across all of these different variations. 410 00:45:24,051 --> 00:45:27,808 And you know our prior on Z was, it was diagonal, 411 00:45:27,808 --> 00:45:43,006 so we chose this in order to encourage this to be independent latent variables that can then encode interpretable factors of variation. So because of this now we'll have different dimensions of Z, encoding different interpretable factors of variation. 412 00:45:44,477 --> 00:45:54,771 So, in this example train now on Faces, we'll see as we vary Z1, going up and down, you'll see the amount of smile changing. 413 00:45:54,771 --> 00:46:00,225 So from a frown at the top to like a big smile at the bottom and then as we go vary Z2, 414 00:46:01,997 --> 00:46:07,859 from left to right, you can see the head pose changing. From one direction all the way to the other. 415 00:46:09,883 --> 00:46:18,526 And so one additional thing I want to point out is that as a result of doing this, these Z variables are also good feature representations. 416 00:46:19,510 --> 00:46:26,376 Because they encode how much of these different these different interpretable semantics that we have. 417 00:46:26,376 --> 00:46:32,296 And so we can use our Q of Z given X, the encoder that we've learned and give it an input 418 00:46:32,296 --> 00:46:42,249 images X, we can map this to Z and use the Z as features that we can use for downstream tasks like supervision, or like classification or other tasks. 419 00:46:47,348 --> 00:46:51,434 Okay so just another couple of examples of data generated from VAEs. 420 00:46:51,434 --> 00:47:02,231 So on the left here we have data generated on CIFAR-10, trained on CIFAR-10, and then on the right we have data trained and generated on Faces. 421 00:47:02,231 --> 00:47:08,737 And we'll see so we can see that in general VAEs are able to generate recognizable data. 422 00:47:08,737 --> 00:47:15,493 One of the main drawbacks of VAEs is that they tend to still have a bit of a blurry aspect to them. 423 00:47:15,493 --> 00:47:20,520 You can see this in the faces and so this is still an active area of research. 424 00:47:22,008 --> 00:47:28,030 Okay so to summarize VAEs, they're a probabilistic spin on traditional autoencoders. 425 00:47:28,030 --> 00:47:36,077 So instead of deterministically taking your input X and going to Z, feature Z and then back to reconstructing X, 426 00:47:36,077 --> 00:47:43,023 now we have this idea of distributions and sampling involved which allows us to generate data. 427 00:47:43,023 --> 00:47:51,101 And in order to train this, VAEs are defining an intractable density. So we can derive and optimize a lower bound, 428 00:47:51,101 --> 00:47:59,718 a variational lower bound, so variational means basically using approximations to handle these types of intractable expressions. 429 00:47:59,718 --> 00:48:03,577 And so this is why this is called a variational autoencoder. 430 00:48:03,577 --> 00:48:10,249 And so some of the advantages of this approach is that VAEs are, they're a principled approach 431 00:48:10,249 --> 00:48:17,628 to generative models and they also allow this inference query so being able to infer things like Q of Z given X. 432 00:48:17,628 --> 00:48:21,554 That we said could be useful feature representations for other tasks. 433 00:48:23,101 --> 00:48:29,548 So disadvantages of VAEs are that while we're maximizing the lower bound of the likelihood, which is okay 434 00:48:29,548 --> 00:48:37,782 like you know in general this is still pushing us in the right direction and there's more other theoretical analysis of this. 435 00:48:37,782 --> 00:48:48,378 So you know, it's doing okay, but it's maybe not still as direct an optimization and evaluation as the pixel RNNs and CNNs that we saw earlier, 436 00:48:48,378 --> 00:49:03,348 but which had, and then, also the VAE samples are tending to be a little bit blurrier and of lower quality compared to state of the art samples that we can see from other generative models such as GANs that we'll talk about next. 437 00:49:04,827 --> 00:49:08,647 And so VAEs now are still, they're still an active area of research. 438 00:49:11,044 --> 00:49:13,447 People are working on more flexible approximations, 439 00:49:13,447 --> 00:49:20,881 so richer approximate posteriors, so instead of just a diagonal Gaussian some richer functions for this. 440 00:49:20,881 --> 00:49:26,992 And then also, another area that people have been working on is incorporating more structure in these latent variables. 441 00:49:26,992 --> 00:49:31,282 So now we had all of these independent latent variables 442 00:49:31,282 --> 00:49:38,077 but people are working on having modeling structure in here, groupings, other types of structure. 443 00:49:41,106 --> 00:49:43,106 Okay, so yeah, question. 444 00:49:44,404 --> 00:49:47,529 [student's words obscured due to lack of microphone] 445 00:49:47,529 --> 00:49:51,394 Yeah, so the question is we're deciding the dimensionality of the latent variable. 446 00:49:51,394 --> 00:49:54,727 Yeah, that's something that you specify. 447 00:49:55,874 --> 00:50:07,481 Okay, so we've talked so far about pixelCNNs and VAEs and now we'll take a look at a third and very popular type of generative model called GANs. 448 00:50:10,019 --> 00:50:15,713 So the models that we've seen so far, pixelCNNs and RNNs define a tractable density function. 449 00:50:15,713 --> 00:50:19,752 And they optimize the likelihood of the trained data. 450 00:50:19,752 --> 00:50:27,752 And then VAEs in contrast to that now have this additional latent variable Z that they define in the generative process. 451 00:50:27,752 --> 00:50:36,858 And so having the Z has a lot of nice properties that we talked about, but they are also cause us to have this intractable density function that we can't 452 00:50:36,858 --> 00:50:43,934 optimize directly and so we derive and optimize a lower bound on the likelihood instead. 453 00:50:43,934 --> 00:50:48,486 And so now what if we just give up on explicitly modeling this density at all? 454 00:50:48,486 --> 00:50:55,267 And we say well what we want is just the ability to sample and to have nice samples from our distribution. 455 00:50:56,501 --> 00:50:59,175 So this is the approach that GANs take. 456 00:50:59,175 --> 00:51:02,637 So in GANs we don't work with an explicit density function, 457 00:51:02,637 --> 00:51:05,642 but instead we're going to take a game-theoretic approach 458 00:51:05,642 --> 00:51:13,839 and we're going to learn to generate from our training distribution through a set up of a two player game, and we'll talk about this in more detail. 459 00:51:15,255 --> 00:51:24,681 So, in the GAN set up we're saying, okay well what we want, what we care about is we want to be able to sample from a complex high dimensional training distribution. 460 00:51:24,681 --> 00:51:31,170 So if we think about well we want to produce samples from this distribution, there's no direct way that we can do this. 461 00:51:31,170 --> 00:51:35,078 We have this very complex distribution, we can't just take samples from here. 462 00:51:35,078 --> 00:51:46,875 So the solution that we're going to take is that we can, however, sample from simpler distributions. For example random noise, right? Gaussians are, these we can sample from. 463 00:51:46,875 --> 00:51:56,789 And so what we're going to do is we're going to learn a transformation from these simple distributions directly to the training distribution that we want. 464 00:51:58,790 --> 00:52:04,304 So the question, what can we used to represent this complex distribution? 465 00:52:06,120 --> 00:52:07,718 Neural network, I heard the answer. 466 00:52:07,718 --> 00:52:14,373 So when we want to model some kind of complex function or transformation we use a neural network. 467 00:52:14,373 --> 00:52:23,297 Okay so what we're going to do is we're going to take in the GAN set up, we're going to take some input which is a vector of some dimension that we specify 468 00:52:23,297 --> 00:52:33,628 of random noise and then we're going to pass this through a generator network, and then we're going to get as output directly a sample from the training distribution. 469 00:52:33,628 --> 00:52:40,154 So every input of random noise we want to correspond to a sample from the training distribution. 470 00:52:41,278 --> 00:52:48,737 And so the way we're going to train and learn this network is that we're going to look at this as a two player game. 471 00:52:48,737 --> 00:52:54,595 So we have two players, a generator network as well as an additional discriminator network that I'll show next. 472 00:52:54,595 --> 00:53:04,320 And our generator network is going to try to, as player one, it's going to try to fool the discriminator by generating real looking images. 473 00:53:04,320 --> 00:53:12,462 And then our second player, our discriminator network is then going to try to distinguish between real and fake images. 474 00:53:12,462 --> 00:53:23,323 So it wants to do as good a job as possible of trying to determine which of these images are counterfeit or fake images generated by this generator. 475 00:53:25,425 --> 00:53:27,324 Okay so what this looks like is, 476 00:53:27,324 --> 00:53:31,203 we have our random noise going to our generator network, 477 00:53:31,203 --> 00:53:36,121 generator network is generating these images that we're going to call, they're fake from our generator. 478 00:53:36,121 --> 00:53:42,439 And then we're going to also have real images that we take from our training set and then we want the 479 00:53:42,439 --> 00:53:50,881 discriminator to be able to distinguish between real and fake images. 480 00:53:50,881 --> 00:53:52,849 Output real and fake for each images. 481 00:53:52,849 --> 00:54:01,638 So the idea is if we're able to have a very good discriminator, we want to train a good discriminator, if it can do a good job of discriminating real versus fake, 482 00:54:01,638 --> 00:54:11,140 and then if our generator network is able to generate, if it's able to do well and generate fake images that can successfully fool this discriminator, 483 00:54:11,140 --> 00:54:13,135 then we have a good generative model. 484 00:54:13,135 --> 00:54:17,431 We're generating images that look like images from the training set. 485 00:54:19,482 --> 00:54:25,548 Okay, so we have these two players and so we're going to train this jointly in a minimax game formulation. 486 00:54:25,548 --> 00:54:28,941 So this minimax objective function is what we have here. 487 00:54:28,941 --> 00:54:37,399 We're going to take, it's going to be minimum over theta G our parameters of our generator network G, 488 00:54:37,399 --> 00:54:44,848 and maximum over parameter Zeta of our Discriminator network D, of this objective, right, these terms. 489 00:54:47,177 --> 00:54:49,624 And so if we look at these terms, what this is saying 490 00:54:49,624 --> 00:54:54,910 is well this first thing, expectation over data of log of D given X. 491 00:54:56,094 --> 00:55:01,151 This log of D of X is the discriminator output for real data X. 492 00:55:01,151 --> 00:55:09,309 This is going to be likelihood of real data being real from the data distribution P data. 493 00:55:09,309 --> 00:55:16,882 And then the second term here, expectation of Z drawn from P of Z, Z drawn from P of Z means samples from 494 00:55:16,882 --> 00:55:27,577 our generator network and this term D of G of Z that we have here is the output of our discriminator for generated fake data for our, 495 00:55:29,109 --> 00:55:33,769 what does the discriminator output of G of Z which is our fake data. 496 00:55:36,311 --> 00:55:43,105 And so if we think about this is trying to do, our discriminator wants to maximize this objective, right, 497 00:55:43,105 --> 00:55:53,278 it's a max over theta D such that D of X is close to one. It's close to real, it's high for the real data. 498 00:55:53,278 --> 00:56:02,679 And then D of G of X, what it thinks of the fake data on the left here is small, we want this to be close to zero. 499 00:56:02,679 --> 00:56:09,237 So if we're able to maximize this, this means discriminator is doing a good job of distinguishing between real and zero. 500 00:56:09,237 --> 00:56:13,449 Basically classifying between real and fake data. 501 00:56:13,449 --> 00:56:22,375 And then our generator, here we want the generator to minimize this objective such that D of G of Z is close to one. 502 00:56:22,375 --> 00:56:35,236 So if this D of G of Z is close to one over here, then the one minus side is small and basically we want to, if we minimize this term then, then it's having 503 00:56:36,768 --> 00:56:39,175 discriminator think that our fake data's actually real. 504 00:56:39,175 --> 00:56:44,087 So that means that our generator is producing real samples. 505 00:56:44,087 --> 00:56:51,139 Okay so this is the important objective of GANs to try and understand so are there any questions about this? 506 00:56:51,139 --> 00:57:01,360 [student's words obscured due to lack of microphone] I'm not sure I understand your question, can you, [student's words obscured due to lack of microphone] 507 00:57:12,334 --> 00:57:23,067 Yeah, so the question is is this basically trying to have the first network produce real looking images that our second network, the discriminator cannot distinguish between. 508 00:57:30,474 --> 00:57:36,809 Okay, so the question is how do we actually label the data or do the training for these networks. 509 00:57:36,809 --> 00:57:46,180 We'll see how to train the networks next. But in terms of like what is the data label basically, this is unsupervised, so there's no data labeling. 510 00:57:46,180 --> 00:57:52,805 But data generated from the generator network, the fake images have a label of basically zero or fake. 511 00:57:52,805 --> 00:58:00,344 And we can take training images that are real images and this basically has a label of one or real. 512 00:58:00,344 --> 00:58:04,866 So when we have, the loss function for our discriminator is using this. 513 00:58:04,866 --> 00:58:09,819 It's trying to output a zero for the generator images and a one for the real images. 514 00:58:09,819 --> 00:58:12,048 So there's no external labels. 515 00:58:12,048 --> 00:58:15,136 [student's words obscured due to lack of microphone] 516 00:58:15,136 --> 00:58:22,119 So the question is the label for the generator network will be the output for the discriminator network. 517 00:58:22,119 --> 00:58:29,321 The generator is not really doing, it's not really doing classifications necessarily. 518 00:58:29,321 --> 00:58:35,536 What it's objective is is here, D of G of Z, it wants this to be high. 519 00:58:35,536 --> 00:58:42,487 So given a fixed discriminator, it wants to learn the generator parameter such that this is high. 520 00:58:42,487 --> 00:58:47,752 So we'll take the fixed discriminator output and use that to do the backprop. 521 00:58:51,447 --> 00:58:54,219 Okay so in order to train this, what we're going to do 522 00:58:54,219 --> 00:58:57,714 is we're going to alternate between gradient ascent 523 00:58:57,714 --> 00:59:05,222 on our discriminator, so we're trying to learn theta beta to maximizing this objective. 524 00:59:05,222 --> 00:59:08,059 And then gradient descent on the generator. 525 00:59:08,059 --> 00:59:15,698 So taking gradient ascent on these parameters theta G such that we're minimizing this and this objective. 526 00:59:15,698 --> 00:59:23,748 And here we are only taking this right part over here because that's the only part that's dependent on theta G parameters. 527 00:59:26,574 --> 00:59:30,603 Okay so this is how we can train this GAN. 528 00:59:30,603 --> 00:59:35,716 We can alternate between training our discriminator and our generator in this game, each trying to fool 529 00:59:35,716 --> 00:59:40,561 the other or generator trying to fool the discriminator. 530 00:59:40,561 --> 00:59:50,478 But one thing that is important to note is that in practice this generator objective as we've just defined actually doesn't work that well. 531 00:59:50,478 --> 00:59:55,309 And the reason for this is we have to look at the loss landscape. 532 00:59:55,309 --> 01:00:01,059 So if we look at the loss landscape over here for D of G of X, 533 01:00:02,858 --> 01:00:10,654 if we apply here one minus D of G of X which is what we want to minimize for the generator, it has this shape here. 534 01:00:12,748 --> 01:00:21,119 So we want to minimize this and it turns out the slope of this loss is actually going to be higher towards the right. 535 01:00:21,119 --> 01:00:24,369 High when D of G of Z is closer to one. 536 01:00:26,915 --> 01:00:36,837 So that means that when our generator is doing a good job of fooling the discriminator, we're going to have a high gradient, more higher gradient terms. 537 01:00:36,837 --> 01:00:44,794 And on the other hand when we have bad samples, our generator has not learned a good job yet, it's not good at generating yet, 538 01:00:44,794 --> 01:00:52,159 then this is when the discriminator can easily tell it's now closer to this zero region on the X axis. 539 01:00:53,002 --> 01:00:55,482 Then here the gradient's relatively flat. 540 01:00:55,482 --> 01:01:03,977 And so what this actually means is that our our gradient signal is dominated by region where the sample is already pretty good. 541 01:01:05,200 --> 01:01:12,624 Whereas we actually want it to learn a lot when the samples are bad, right? These are training samples that we want to learn from. 542 01:01:12,624 --> 01:01:21,664 And so in order to, so this basically makes it hard to learn and so in order to improve learning, 543 01:01:21,664 --> 01:01:26,320 what we're going to do is define a different, slightly different objective function for the gradient. 544 01:01:26,320 --> 01:01:30,145 Where now we're going to do gradient ascent instead. 545 01:01:30,145 --> 01:01:35,748 And so instead of minimizing the likelihood of our discriminator being correct, which is what we had earlier, 546 01:01:35,748 --> 01:01:40,908 now we'll kind of flip it and say let's maximize the likelihood of our discriminator being wrong. 547 01:01:40,908 --> 01:01:49,720 And so this will produce this objective here of maximizing, maximizing log of D of G of X. 548 01:01:50,767 --> 01:01:55,102 And so, now basically we want to, there should be a negative sign here. 549 01:01:59,160 --> 01:02:08,659 But basically we want to now maximize this flip objective instead and what this now does is if we plot this function 550 01:02:10,118 --> 01:02:16,149 on the right here, then we have a high gradient signal in this region on the left where we have bad samples, 551 01:02:16,149 --> 01:02:23,242 and now the flatter region is to the right where we would have good samples. 552 01:02:23,242 --> 01:02:26,571 So now we're going to learn more from regions of bad samples. 553 01:02:26,571 --> 01:02:35,990 And so this has the same objective of fooling the discriminator but it actually works much better in practice and for a lot of work on GANs that are 554 01:02:35,990 --> 01:02:41,492 using these kind of vanilla GAN formulation is actually using this objective. 555 01:02:44,220 --> 01:02:59,079 Okay so just an aside on that is that jointly training these two networks is challenging and can be unstable. So as we saw here, like we're alternating between training a discriminator and training a generator. 556 01:02:59,079 --> 01:03:08,398 This type of alternation is, basically it's hard to learn two networks at once and there's also this issue 557 01:03:08,398 --> 01:03:13,815 of depending on what our loss landscape looks at, it can affect our training dynamics. 558 01:03:13,815 --> 01:03:23,342 So an active area of research still is how can we choose objectives with better loss landscapes that can help training and make it more stable? 559 01:03:26,516 --> 01:03:31,152 Okay so now let's put this all together and look at the full GAN training algorithm. 560 01:03:31,152 --> 01:03:34,366 So what we're going to do is for each iteration of training 561 01:03:34,366 --> 01:03:41,078 we're going to first train the generation, train the discriminator network a bit and then train the generator network. 562 01:03:41,078 --> 01:03:43,959 So for k steps of training the discriminator network 563 01:03:43,959 --> 01:03:55,859 we'll sample a mini batch of noise samples from our noise prior Z and then also sample a mini batch of real samples from our training data X. 564 01:03:57,366 --> 01:04:04,519 So what we'll do is we'll pass the noise through our generator, we'll get our fake images out. 565 01:04:04,519 --> 01:04:08,052 So we have a mini batch of fake images and mini batch of real images. 566 01:04:08,052 --> 01:04:15,041 And then we'll pick a gradient step on the discriminator using this mini batch, our fake and our real images 567 01:04:15,041 --> 01:04:17,891 and then update our discriminator parameters. 568 01:04:17,891 --> 01:04:24,313 And use this and do this a certain number of iterations to train the discriminator for a bit basically. 569 01:04:24,313 --> 01:04:28,803 And then after that we'll go to our second step which is training the generator. 570 01:04:28,803 --> 01:04:32,544 And so here we'll sample just a mini batch of noise samples. 571 01:04:32,544 --> 01:04:43,102 We'll pass this through our generator and then now we want to do backpop on this to basically optimize our generator objective that we saw earlier. 572 01:04:45,078 --> 01:04:49,705 So we want to have our generator fool our discriminator as much as possible. 573 01:04:50,773 --> 01:04:58,895 And so we're going to alternate between these two steps of taking gradient steps for our discriminator and for the generator. 574 01:04:59,996 --> 01:05:07,709 And I said for k steps up here, for training the discriminator and so this is kind of a topic of debate. 575 01:05:08,604 --> 01:05:15,391 Some people think just having one iteration of discriminator one type of discriminator, one type of generator is best. 576 01:05:15,391 --> 01:05:20,744 Some people think it's better to train the discriminator for a little bit longer before switching to the generator. 577 01:05:20,744 --> 01:05:30,732 There's no real clear rule and it's something that people have found different things to work better depending on the problem. 578 01:05:30,732 --> 01:05:45,028 And one thing I want to point out is that there's been a lot of recent work that alleviates this problem and makes it so you don't have to spend so much effort trying to balance how the training of these two networks. 579 01:05:45,028 --> 01:05:47,880 It'll have more stable training and give better results. 580 01:05:47,880 --> 01:05:55,655 And so Wasserstein GAN is an example of a paper that was an important work towards doing this. 581 01:06:00,313 --> 01:06:09,767 Okay so looking at the whole picture we've now trained, we have our network setup, we've trained both our generator network and our discriminator network 582 01:06:09,767 --> 01:06:16,899 and now after training for generation, we can just take our generator network and use this to generate new images. 583 01:06:16,899 --> 01:06:21,520 So we just take noise Z and pass this through and generate fake images from here. 584 01:06:23,636 --> 01:06:28,351 Okay and so now let's look at some generated samples from these GANs. 585 01:06:28,351 --> 01:06:33,099 So here's an example of trained on MNIST and then on the right on Faces. 586 01:06:33,099 --> 01:06:43,849 And for each of these you can also see, just for visualization the closest, on the right, the nearest neighbor from the training set to the column right next to it. 587 01:06:43,849 --> 01:06:49,227 And so you can see that we're able to generate very realistic samples and it never directly memorizes the training set. 588 01:06:51,264 --> 01:06:56,061 And here are some examples from the original GAN paper on CIFAR images. 589 01:06:56,061 --> 01:07:07,374 And these are still fairly, not such good quality yet, these were, the original work is from 2014, so these are some older, simpler networks. 590 01:07:07,374 --> 01:07:11,541 And these were using simple, fully connected networks. 591 01:07:12,550 --> 01:07:16,018 And so since that time there's been a lot of work on improving GANs. 592 01:07:18,120 --> 01:07:31,388 One example of a work that really took a big step towards improving the quality of samples is this work from Alex Radford in ICLR 2016 on adding convolutional architectures to GANs. 593 01:07:33,806 --> 01:07:42,958 In this paper there was a whole set of guidelines on architectures for helping GANs to produce better samples. 594 01:07:42,958 --> 01:07:46,517 So you can look at this for more details. 595 01:07:46,517 --> 01:07:52,669 This is an example of a convolutional architecture that they're using which is going from our input Z 596 01:07:52,669 --> 01:07:57,694 noise vector Z and transforming this all the way to the output sample. 597 01:08:00,527 --> 01:08:08,251 So now from this large convolutional architecture we'll see that the samples from this model are really starting to look very good. 598 01:08:08,251 --> 01:08:11,408 So this is trained on a dataset of bedrooms 599 01:08:11,408 --> 01:08:15,575 and we can see all kinds of very realistic fancy looking 600 01:08:16,783 --> 01:08:26,063 bedrooms with windows and night stands and other furniture around there so these are some really pretty samples. 601 01:08:26,064 --> 01:08:32,346 And we can also try and interpret a little bit of what these GANs are doing. 602 01:08:32,346 --> 01:08:42,817 So in this example here what we can do is we can take two points of Z, two different random noise vectors and let's just interpolate between these points. 603 01:08:42,818 --> 01:08:50,142 And each row across here is an interpolation from one random noise Z to another random noise vector Z 604 01:08:50,142 --> 01:08:57,072 and you can see that as it's changing, it's smoothly interpolating the image as well all the way over. 605 01:08:59,286 --> 01:09:02,067 And so something else that we can do is we can see that, 606 01:09:02,067 --> 01:09:10,313 well, let's try to analyze further what these vectors Z mean, and so we can try and do vector math on here. 607 01:09:10,313 --> 01:09:17,828 So what this experiment does is it says okay, let's take some images of smiling, 608 01:09:17,828 --> 01:09:26,628 samples of smiling women images and then let's take some samples of neutral women and then also some samples of neutral men. 609 01:09:28,341 --> 01:09:34,920 And so let's try and do take the average of the Z vectors that produced each of these samples and if we, 610 01:09:34,920 --> 01:09:45,037 Say we take this, mean vector for the smiling women, subtract the mean vector for the neutral women and add the mean vector for the neutral man, what do we get? 611 01:09:46,651 --> 01:09:49,884 And we get samples of smiling man. 612 01:09:49,884 --> 01:09:56,200 So we can take the Z vector produced there, generate samples and get samples of smiling men. 613 01:09:57,190 --> 01:10:03,879 And we can have another example of this. Of glasses man minus no glasses man and plus glasses women. 614 01:10:05,918 --> 01:10:08,763 And get women with glasses. 615 01:10:08,763 --> 01:10:18,358 So here you can see that basically the Z has this type of interpretability that you can use this to generate some pretty cool examples. 616 01:10:20,026 --> 01:10:23,967 Okay so this year, 2017 has really been the year of the GAN. 617 01:10:24,842 --> 01:10:33,261 There's been tons and tons of work on GANs and it's really sort of exploded and gotten some really cool results. 618 01:10:33,261 --> 01:10:38,680 So on the left here you can see people working on better training and generation. 619 01:10:38,680 --> 01:10:45,621 So we talked about improving the loss functions, more stable training and this was able to get really nice 620 01:10:47,216 --> 01:10:50,173 generations here of different types of architectures 621 01:10:50,173 --> 01:10:54,326 on the bottom here really crisp high resolution faces. 622 01:10:54,326 --> 01:11:01,742 With GANs you can also do, there's also been models on source to try to domain transfer and conditional GANs. 623 01:11:01,742 --> 01:11:08,363 And so here, this is an example of source to try to get domain transfer where, for example in the upper part 624 01:11:08,363 --> 01:11:14,703 here we are trying to go from source domain of horses to an output domain of zebras. 625 01:11:14,703 --> 01:11:25,813 So we can take an image of horses and train a GAN such that the output is going to be the same thing but now zebras in the same image setting as the horses 626 01:11:28,408 --> 01:11:33,124 and go the other way around. We can transform apples into oranges. 627 01:11:33,124 --> 01:11:38,608 And also the other way around. We can also use this to do photo enhancement. 628 01:11:38,608 --> 01:11:52,379 So producing these, really taking a standard photo and trying to make really nice, as if you had, pretending that you have a really nice expensive camera. That you can get the nice blur effects. 629 01:11:52,379 --> 01:12:03,750 On the bottom here we have scene changing, so transforming an image of Yosemite from the image in winter time to the image in summer time. 630 01:12:03,750 --> 01:12:05,753 And there's really tons of applications. 631 01:12:05,753 --> 01:12:16,373 So on the right here there's more. There's also going from a text description and having a GAN that's now conditioned on this text description and producing an image. 632 01:12:18,343 --> 01:12:26,421 So there's something here about a small bird with a pink breast and crown and now we're going to generate images of this. 633 01:12:26,421 --> 01:12:37,383 And there's also examples down here of filling in edges. So given conditions on some sketch that we have, can we fill in a color version of what this would look like. 634 01:12:40,848 --> 01:12:50,416 Can we take a Google, a map grid and put something that looks like Google Earth on, and turn it into something that looks like Google Earth. 635 01:12:52,528 --> 01:12:56,767 Go in and hallucinate all of these buildings and trees and so on. 636 01:12:56,767 --> 01:13:07,061 And so there's lots of really cool examples of this. And there's also this website for pics to pics which did a lot of these kind of conditional GAN type examples. 637 01:13:08,077 --> 01:13:17,549 I encourage you to go look at for more interesting applications that people have done with GANs. 638 01:13:17,549 --> 01:13:24,640 And in terms of research papers there's also there's a huge number of papers about GANs this year now. 639 01:13:26,047 --> 01:13:31,365 There's a website called the GAN Zoo that kind of is trying to compile a whole list of these. 640 01:13:31,365 --> 01:13:44,794 And so here this has only taken me from A through C on the left here and through like L on the right. So it won't even fit on the slide. There's tons of papers as well that you can look at if you're interested. 641 01:13:44,794 --> 01:13:57,376 And then one last pointer is also for tips and tricks for training GANs, here's a nice little website that has pointers if you're trying to train these GANs in practice. 642 01:14:01,313 --> 01:14:06,915 Okay, so summary of GANs. GANs don't work with an explicit density function. 643 01:14:06,915 --> 01:14:13,989 Instead we're going to represent this implicitly through samples and they take a game-theoretic approach to training 644 01:14:13,989 --> 01:14:18,973 so we're going to learn to generate from our training distribution through a two player game setup. 645 01:14:18,973 --> 01:14:26,212 And the pros of GANs are that they're really having gorgeous state of the art samples and you can do a lot with these. 646 01:14:26,212 --> 01:14:33,247 The cons are that they are trickier and more unstable to train, we're not just directly optimizing 647 01:14:36,499 --> 01:14:41,830 a one objective function that we can just do backpop and train easily. 648 01:14:41,830 --> 01:14:47,710 Instead we have these two networks that we're trying to balance training with so it can be a bit more unstable. 649 01:14:47,710 --> 01:14:57,629 And we also can lose out on not being able to do some of the inference queries, P of X, P of Z given X that we had for example in our VAE. 650 01:14:57,629 --> 01:15:07,040 And GANs are still an active area of research, this is a relatively new type of model that we're starting to see a lot of and you'll be seeing a lot more of. 651 01:15:07,040 --> 01:15:20,633 And so people are still working now on better loss functions more stable training, so Wasserstein GAN for those of you who are interested is basically an improvement in this direction. 652 01:15:22,224 --> 01:15:31,489 That now a lot of people are also using and basing models off of. There's also other works like LSGAN, Least Square's GAN, Least Square's GAN and others. 653 01:15:31,489 --> 01:15:39,307 So you can look into this more. And a lot of times for these new models in terms of actually implementing this, they're not necessarily big changes. 654 01:15:39,307 --> 01:15:44,279 They're different loss functions that you can change a little bit and get like a big improvement in training. 655 01:15:44,279 --> 01:15:51,500 And so this is, some of these are worth looking into and you'll also get some practice on your homework assignment. 656 01:15:51,500 --> 01:15:59,946 And there's also a lot of work on different types of conditional GANs and GANs for all kinds of different problem setups and applications. 657 01:16:01,648 --> 01:16:05,807 Okay so a recap of today. We talked about generative models. 658 01:16:05,807 --> 01:16:12,329 We talked about three of the most common kinds of generative models that people are using and doing research on today. 659 01:16:12,329 --> 01:16:17,588 So we talked first about pixelRNN and pixelCNN, which is an explicit density model. 660 01:16:17,588 --> 01:16:26,981 It optimizes the exact likelihood and it produces good samples but it's pretty inefficient because of the sequential generation. 661 01:16:26,981 --> 01:16:35,090 We looked at VAE which optimizes a variational or lower bound on the likelihood and this also produces useful a latent representation. 662 01:16:35,090 --> 01:16:40,305 You can do inference queries. But the example quality is still not the best. 663 01:16:40,305 --> 01:16:47,657 So even though it has a lot of promise, it's still a very active area of research and has a lot of open problems. 664 01:16:47,657 --> 01:16:57,375 And then GANs we talked about is a game-theoretic approach for training and it's what currently achieves the best state of the art examples. 665 01:16:57,375 --> 01:17:05,047 But it can also be tricky and unstable to train and it loses out a bit on the inference queries. 666 01:17:05,047 --> 01:17:10,239 And so what you'll also see is a lot of recent work on combinations of these kinds of models. 667 01:17:10,239 --> 01:17:12,733 So for example adversarial autoencoders. 668 01:17:12,733 --> 01:17:18,478 Something like a VAE trained with an additional adversarial loss on top which improves the sample quality. 669 01:17:18,478 --> 01:17:32,444 There's also things like pixelVAE is now a combination of pixelCNN and VAE so there's a lot of combinations basically trying to take the best of all these worlds and put them together. 670 01:17:32,444 --> 01:17:40,449 Okay so today we talked about generative models and next time we'll talk about reinforcement learning. Thanks.